The primary objective of this project is to utilize data mining techniques to extract valuable insights from sales data. By examining patterns, trends, and relationships within the data, the project aims to identify opportunities for optimizing sales strategies and enhancing overall performance. Through this analysis, we seek to empower decision-makers with actionable insights that can drive business decisions.¶
data set used : https://www.kaggle.com/datasets/ahmedabbas757/dataset/data¶
frist step we will clean the data¶
- this all imports we need
In [132]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import numpy as np
from sklearn_extra.cluster import KMedoids
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
In [133]:
df = pd.read_csv('data_sales.csv')
print("Info about the data:")
print(df.info())
Info about the data: <class 'pandas.core.frame.DataFrame'> RangeIndex: 9641 entries, 0 to 9640 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Retailer 9641 non-null object 1 Retailer ID 9641 non-null int64 2 Invoice Date 9641 non-null object 3 Region 9641 non-null object 4 State 9641 non-null object 5 City 9641 non-null object 6 Product 9641 non-null object 7 Price per Unit 9639 non-null object 8 Units Sold 9641 non-null object 9 Total Sales 9641 non-null object 10 Operating Profit 9641 non-null object 11 Sales Method 9641 non-null object dtypes: int64(1), object(11) memory usage: 904.0+ KB None
we will see the number the null and duplicate rows¶
In [134]:
null_count = df.isnull().sum().sum()
duplicate_count = df.duplicated().sum()
print("Number of null rows:", null_count)
print("Number of duplicate rows:", duplicate_count)
Number of null rows: 2 Number of duplicate rows: 0
we will delete it in the next code¶
In [135]:
cleand_dataset = df.dropna()
cleand_dataset = df.drop_duplicates()
print("Info about the cleaned data:")
print(cleand_dataset.info())
df==cleand_dataset
Info about the cleaned data: <class 'pandas.core.frame.DataFrame'> RangeIndex: 9641 entries, 0 to 9640 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Retailer 9641 non-null object 1 Retailer ID 9641 non-null int64 2 Invoice Date 9641 non-null object 3 Region 9641 non-null object 4 State 9641 non-null object 5 City 9641 non-null object 6 Product 9641 non-null object 7 Price per Unit 9639 non-null object 8 Units Sold 9641 non-null object 9 Total Sales 9641 non-null object 10 Operating Profit 9641 non-null object 11 Sales Method 9641 non-null object dtypes: int64(1), object(11) memory usage: 904.0+ KB None
Out[135]:
| Retailer | Retailer ID | Invoice Date | Region | State | City | Product | Price per Unit | Units Sold | Total Sales | Operating Profit | Sales Method | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | True | True | True | True | True | True | True | True | True | True | True | True |
| 1 | True | True | True | True | True | True | True | True | True | True | True | True |
| 2 | True | True | True | True | True | True | True | True | True | True | True | True |
| 3 | True | True | True | True | True | True | True | True | True | True | True | True |
| 4 | True | True | True | True | True | True | True | True | True | True | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9636 | True | True | True | True | True | True | True | True | True | True | True | True |
| 9637 | True | True | True | True | True | True | True | True | True | True | True | True |
| 9638 | True | True | True | True | True | True | True | True | True | True | True | True |
| 9639 | True | True | True | True | True | True | True | True | True | True | True | True |
| 9640 | True | True | True | True | True | True | True | True | True | True | True | True |
9641 rows × 12 columns
now we will edit in data type of colomns bec ths sales make as object not float¶
In [136]:
float_columns = ['Price per Unit', 'Units Sold', 'Total Sales', 'Operating Profit']
for col in float_columns:
df[col] = df[col].str.replace('$', '').str.replace(',', '').astype(float)
df['Invoice Date'] = pd.to_datetime(df['Invoice Date'], errors='coerce')
print("Data types after conversion:")
print(df.dtypes)
Data types after conversion: Retailer object Retailer ID int64 Invoice Date datetime64[ns] Region object State object City object Product object Price per Unit float64 Units Sold float64 Total Sales float64 Operating Profit float64 Sales Method object dtype: object
now we will delete the outline to get the true information¶
i will use the IQR to delete the outlines¶
some information about IQR¶
When using the IQR method for outlier detection and treatment, you typically have two main options: deleting the rows containing outliers or replacing the outliers with more typical values.¶
I use the IQR (replacing)¶
In [137]:
plt.figure(figsize=(15,5))
sns.set(style="whitegrid")
for i, column in enumerate(float_columns):
plt.subplot(1, len(float_columns), i+1)
sns.boxplot(data=df, x=column)
plt.title(column)
plt.tight_layout()
plt.show()
In [138]:
for col in float_columns:
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
whisker_width = 1.5
lower_whisker = q1 - (whisker_width * iqr)
upper_whisker = q3 + whisker_width * iqr
df[col] = np.where(df[col] > upper_whisker, upper_whisker, np.where(df[col] < lower_whisker, lower_whisker, df[col]))
In [139]:
plt.figure(figsize=(15,5))
sns.set(style="whitegrid")
for i, column in enumerate(float_columns):
plt.subplot(1, len(float_columns), i+1)
sns.boxplot(data=df, x=column)
plt.title(column)
plt.tight_layout()
plt.show()
In [140]:
df['Product'] = df['Product'].replace("Men's aparel", "Men's Apparel")
df = df[df['Units Sold'] != 0]
df['Total Sales']=df['Price per Unit'] * df['Units Sold']
df['profit_percentage'] = (df['Operating Profit'] / df['Total Sales']) * 100
df["profit_percentage"] = df['profit_percentage'].astype('float').round()
df['Operating Profit'] = df['Total Sales'] * (df['profit_percentage'] / 100)
df.drop(columns = ['profit_percentage'], inplace = True)
df[float_columns] = df[float_columns].fillna(df[float_columns].mean())
df.to_csv('modified_data_sales.csv', index=False)
replace the value "Men's aparel" in the "Product" column of the DataFrame with "Men's Apparel".¶
remove rows where the value in the "Units Sold" column is equal to zero.¶
calculate a new column called "Total Sales" by multiplying the values in the "Price per Unit" and "Units Sold" columns.¶
calculate the profit percentage using the "Operating Profit" and "Total Sales" columns and rounds it to the nearest integer.¶
recalculate the operating profit based on the calculated profit percentage and stores the results in the "Operating Profit" column.¶
filling missing values in numerical columns with the mean value.¶
EDA¶
In [141]:
#to display first rows
print(df.head())
#information about data
print(df.info())
# Statistical summary
print(df.describe())
Retailer Retailer ID Invoice Date Region State City \
0 Walmart 1128299 2021-06-17 Southeast Florida Orlando
1 West Gear 1128299 2021-07-16 South Louisiana New Orleans
2 Sports Direct 1197831 2021-08-25 South Alabama Birmingham
3 Sports Direct 1197831 2021-08-27 South Alabama Birmingham
4 Sports Direct 1197831 2021-08-21 South Alabama Birmingham
Product Price per Unit Units Sold Total Sales \
0 Women's Apparel 85.0 218.0 18530.0
1 Women's Apparel 85.0 163.0 13855.0
2 Men's Street Footwear 10.0 700.0 7000.0
3 Women's Street Footwear 15.0 575.0 8625.0
4 Women's Street Footwear 15.0 475.0 7125.0
Operating Profit Sales Method
0 1297.10 Online
1 831.30 Online
2 3150.00 Outlet
3 3881.25 Outlet
4 3206.25 Outlet
<class 'pandas.core.frame.DataFrame'>
Index: 9637 entries, 0 to 9640
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Retailer 9637 non-null object
1 Retailer ID 9637 non-null int64
2 Invoice Date 9637 non-null datetime64[ns]
3 Region 9637 non-null object
4 State 9637 non-null object
5 City 9637 non-null object
6 Product 9637 non-null object
7 Price per Unit 9637 non-null float64
8 Units Sold 9637 non-null float64
9 Total Sales 9637 non-null float64
10 Operating Profit 9637 non-null float64
11 Sales Method 9637 non-null object
dtypes: datetime64[ns](1), float64(4), int64(1), object(6)
memory usage: 978.8+ KB
None
Retailer ID Invoice Date Price per Unit \
count 9.637000e+03 9637 9637.000000
mean 1.173846e+06 2021-05-10 16:52:11.929023488 45.145719
min 1.128299e+06 2020-01-01 00:00:00 7.000000
25% 1.185732e+06 2021-02-17 00:00:00 35.000000
50% 1.185732e+06 2021-06-04 00:00:00 45.000000
75% 1.185732e+06 2021-09-16 00:00:00 55.000000
max 1.197831e+06 2021-12-31 00:00:00 85.000000
std 2.636304e+04 NaN 14.473482
Units Sold Total Sales Operating Profit
count 9637.000000 9637.000000 9637.000000
mean 250.025734 12037.611520 3029.362764
min 6.000000 160.000000 8.000000
25% 106.000000 4068.000000 191.520000
50% 176.000000 7805.000000 440.000000
75% 350.000000 15750.000000 5200.000000
max 716.000000 60860.000000 12888.000000
std 194.848704 11495.128247 4160.986010
pair plot to visualize every single column¶
In [78]:
warnings.filterwarnings("ignore", category=FutureWarning)
sns.pairplot(df[['Total Sales', 'Operating Profit', 'Price per Unit']])
plt.show()
this matrix to visualize the correlation between columns¶
The distribution of sales is skewed to the right, indicating that there were more instances of lower sales figures than higher sales figures.¶
In [79]:
selected_columns = ['Price per Unit', 'Units Sold', 'Total Sales','Operating Profit']
new_data = df[selected_columns].copy()
In [80]:
correlation_matrix = new_data.corr()
plt.figure(figsize=(6, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
we found that total sales and price are most correlated but still low correlation¶
to visualize the distrbution of units sales¶
In [81]:
df['Units Sold'].unique()[-50:]
df['Units Sold'] = df['Units Sold'].astype('int')
sns.histplot(data = df, x = "Units Sold", kde=True)
plt.show()
this plot to visualize top state in sales¶
In [82]:
plt.figure(figsize = (15,6))
graph = sns.countplot(x = "State", data = df, order = df.State.value_counts()[:20].index, palette = "RdBu")
for container in graph.containers:
graph.bar_label(container)
plt.xticks(rotation = 45)
plt.show()
In [83]:
# Read the CSV file and select the columns you want
selected_columns = ['Total Sales', 'Operating Profit', 'Price per Unit']
data = df[selected_columns]
# Convert the data to a numpy array
data_a = np.array(data)
def optimal_k(data, max_clusters=4):
inirtias = []
for i in range(1, max_clusters + 1):
kmedoids = KMedoids(n_clusters=i, random_state=0).fit(data)
inirtias.append(kmedoids.inertia_)
mininertia=min(inirtias)
plt.plot(range(1, 5),inirtias , marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
return inirtias.index(mininertia)+1
best = optimal_k(data_a)
print(best)
# Define the number of clusters
k = best
# Perform k-medoids clustering
kmedoids = KMedoids(n_clusters=k).fit(data_a)
clusters = kmedoids.cluster_centers_
labels = kmedoids.labels_
print("Labels: ", labels, "\n")
print("Cluster Centers: ", clusters, "\n")
4 Labels: [1 0 0 ... 0 0 0] Cluster Centers: [[9.0090e+03 7.2072e+02 6.3000e+01] [1.9500e+04 6.8250e+03 6.0000e+01] [3.5800e+04 1.1814e+04 5.0000e+01] [3.4680e+03 2.0808e+02 3.4000e+01]]
In [84]:
plt.figure(figsize=(8, 6))
for j in range(k):
cluster_points = data_a[labels == j]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {j}')
plt.scatter(clusters[:, 0], clusters[:, 1], c='black', marker='x', s=100, label='Cluster Centers')
plt.xlabel(selected_columns[0])
plt.ylabel(selected_columns[1])
plt.title('K-medoids Clustering')
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
X = df[['Price per Unit', 'Total Sales']]
threshold = 1 # Set the distance threshold to determine clusters
z2 = linkage(X, method='single', metric='euclidean')
clusters = fcluster(z2, t=threshold, criterion='distance')
plt.figure(figsize=(10, 6))
plt.scatter(X['Price per Unit'], X['Total Sales'], c=clusters, cmap='viridis')
plt.xlabel('Price per Unit')
plt.ylabel('Total Sales')
plt.title('Scatter Plot of Hierarchical Clustering')
plt.show()
In [131]:
num_clusters = len(set(fcluster(z2, t=1, criterion='distance')))
print("Number of clusters:", num_clusters)
plt.figure(figsize=(10, 6))
dendrogram(z2)
plt.title('Dendrogram of Hierarchical Clustering')
plt.xlabel('Columns')
plt.ylabel('Distance')
plt.show()
Number of clusters: 8
In [87]:
# Loop through each cluster
for j in range(k):
# Filter data points belonging to the cluster
cluster_data = data_a[labels == j]
fig, axes = plt.subplots(len(selected_columns),1, figsize=(10, 12)) # Set figure size
# Plot histograms for each feature on separate subplots
for i, col in enumerate(selected_columns):
axes[i].hist(cluster_data[:, i])
axes[i].set_title(f"Distribution of {col} in Cluster {j}")
axes[i].set_xlabel(col)
axes[i].set_ylabel("Frequency")
fig.suptitle(f"Distribution of Features in Cluster {j}")
plt.tight_layout()
plt.show()
In [88]:
max_sales_by_retailer = df.groupby('Retailer')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_retailer)
Retailer Amazon 53700.0 Foot Locker 60860.0 Kohl's 47250.0 Sports Direct 53700.0 Walmart 60860.0 West Gear 60860.0 Name: Total Sales, dtype: float64
In [89]:
total_sales_by_retailer = df.groupby('Retailer')['Total Sales'].sum()
# Print the total sales for each retailer
print(total_sales_by_retailer)
Retailer Amazon 1.004179e+07 Foot Locker 2.763377e+07 Kohl's 1.333598e+07 Sports Direct 2.389743e+07 Walmart 9.721602e+06 West Gear 3.137590e+07 Name: Total Sales, dtype: float64
In [90]:
max_retailer = total_sales_by_retailer.idxmax()
# Print the total sales for each retailer
print(max_retailer)
West Gear
we found that max retailer in sales is west gear¶
In [91]:
max_sales_by_region = df.groupby('Region')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_region)
Region Midwest 53700.0 Northeast 50120.0 South 60860.0 Southeast 60860.0 West 60860.0 Name: Total Sales, dtype: float64
In [92]:
total_sales_by_region = df.groupby('Region')['Total Sales'].sum()
# Print the total sales for each retailer
print(total_sales_by_region)
Region Midwest 1.655985e+07 Northeast 2.393010e+07 South 1.982409e+07 Southeast 2.007722e+07 West 3.561519e+07 Name: Total Sales, dtype: float64
In [93]:
max_region = total_sales_by_region.idxmax()
# Print the total sales for each retailer
print(max_region)
West
we found that west region occupies most sales¶
In [94]:
df['Month'] = df['Invoice Date'].dt.month
df['Month']
Out[94]:
0 6
1 7
2 8
3 8
4 8
..
9636 11
9637 10
9638 10
9639 4
9640 10
Name: Month, Length: 9637, dtype: int32
In [95]:
def find_seasons(monthNumber):
if monthNumber in [12, 1, 2]:
return 'Winter'
elif monthNumber in [3, 4, 5]:
return 'Spring'
elif monthNumber in [6, 7, 8]:
return 'Summer'
elif monthNumber in [9, 10, 11]:
return 'Autumn'
df['Season'] = df['Month'].apply(find_seasons)
df['Season']
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name()
In [96]:
max_sales_by_Season = df.groupby('Season')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_Season)
Season Autumn 53700.0 Spring 53125.0 Summer 60860.0 Winter 60860.0 Name: Total Sales, dtype: float64
In [97]:
total_sales_by_Season = df.groupby('Season')['Total Sales'].sum()
# Print the total sales for each retailer
print(total_sales_by_Season)
Season Autumn 2.725748e+07 Spring 2.705445e+07 Summer 3.330763e+07 Winter 2.838690e+07 Name: Total Sales, dtype: float64
In [98]:
max_Season = total_sales_by_Season.idxmax()
# Print the total sales for each retailer
print(max_Season)
Summer
In [99]:
def groupData(columnName):
return df.groupby(columnName).agg({'Total Sales' : sum, 'Operating Profit' : 'sum'})
In [100]:
SeasonSales = groupData('Season').sort_values(by = 'Total Sales', ascending = False)
# set size to plot
plt.figure(figsize = (15,6))
# create plot of Total Sales
plt.subplot(1, 2, 1)
sns.lineplot(x = SeasonSales.index, y = "Total Sales", data = SeasonSales, marker = "o")
# Create plot of Operating Profit
plt.subplot(1, 2, 2)
sns.lineplot(x = SeasonSales.index, y = "Operating Profit", data = SeasonSales, marker='o')
plt.show()
we found Sales increase in the summer signficantly¶
In [101]:
max_sales_by_salesmethod = df.groupby('Sales Method')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_salesmethod)
Sales Method In-store 60860.0 Online 60860.0 Outlet 60860.0 Name: Total Sales, dtype: float64
In [102]:
total_sales_by_salesmethod = df.groupby('Sales Method')['Total Sales'].sum()
# Print the total sales for each retailer
print(total_sales_by_salesmethod)
Sales Method In-store 3.450676e+07 Online 4.407039e+07 Outlet 3.742931e+07 Name: Total Sales, dtype: float64
In [103]:
max_salesmethod = total_sales_by_salesmethod.idxmax()
# Print the total sales for each retailer
print(max_salesmethod)
Online
In [104]:
SalesMethod = df.groupby('Sales Method')['Total Sales'].sum().sort_values(ascending = False)
# set size to plot
plt.figure(figsize = (8,4))
# create plot of Total Sales
sns.lineplot(x = SalesMethod.index, y = SalesMethod.values, data = SalesMethod, marker = "o")
plt.show()
we found that online sales are much better in sales¶
In [105]:
df['Year'] = df['Invoice Date'].dt.year
df['Year']
Out[105]:
0 2021
1 2021
2 2021
3 2021
4 2021
...
9636 2021
9637 2021
9638 2021
9639 2021
9640 2021
Name: Year, Length: 9637, dtype: int32
In [106]:
sales_by_month = df.groupby(['Year','Month'])['Total Sales'].sum().reset_index()
# create plot
plt.figure(figsize = (12,6))
sns.lineplot(x = "Month", y = "Total Sales", hue = "Year", data = sales_by_month, marker='o')
plt.show()
we found a signficant increase in sales in 2021 this is due to covid-19 epidimic¶
In [107]:
cluster_counts = np.zeros(k, dtype=int)
for i in range(len(labels)):
cluster_counts[labels[i]] += 1
In [108]:
for j in range(k):
print("Cluster ", j, " Count: ", cluster_counts[j],"points")
Cluster 0 Count: 3056 points Cluster 1 Count: 1508 points Cluster 2 Count: 1134 points Cluster 3 Count: 3939 points
In [112]:
from sklearn.metrics import pairwise_distances
for j in range(k):
cluster_points = data_a[labels == j]
distances = pairwise_distances(cluster_points, metric='euclidean')
avg_distance = np.mean(distances)
print("Cluster", j, "Average Distance:", avg_distance)
Cluster 0 Average Distance: 3418.5170503475556 Cluster 1 Average Distance: 5701.029289463491 Cluster 2 Average Distance: 9111.804274489772 Cluster 3 Average Distance: 1869.1340108374193
In [113]:
clustered_data = pd.DataFrame(data, columns=selected_columns)
clustered_data['Cluster'] = labels
cluster_stats = clustered_data.groupby('Cluster').agg(['mean', 'median', 'std', 'min', 'max', 'count'])
print(cluster_stats)
Total Sales \
mean median std min max count
Cluster
0 9535.300466 9076.5 2348.673728 6125.0 16065.0 3056
1 19979.690981 19500.0 3914.688581 12500.0 28125.0 1508
2 37825.542328 35800.0 7834.021913 27000.0 60860.0 1134
3 3514.355166 3500.0 1533.750641 160.0 6256.0 3939
Operating Profit \
mean median std min max count
Cluster
0 1420.158467 513.12 1535.105625 103.04 7393.75 3056
1 6973.750749 7000.00 2548.591213 489.60 12801.25 1508
2 11673.026808 12565.80 1690.530849 4462.50 12888.00 1134
3 279.344034 162.06 427.941554 8.00 3000.00 3939
Price per Unit
mean median std min max count
Cluster
0 47.758603 47.0 12.754951 10.0 85.0 3056
1 49.410477 50.0 12.419693 20.0 85.0 1508
2 60.568783 60.0 11.835446 40.0 85.0 1134
3 37.045697 37.0 11.859960 7.0 73.0 3939
In [114]:
from sklearn.metrics import pairwise_distances
centroids = kmedoids.cluster_centers_
centroid_distances = pairwise_distances(centroids)
cluster_distances = []
for cluster in np.unique(labels):
cluster_indices = np.where(labels == cluster)[0]
cluster_data = data_a[cluster_indices]
cluster_distances.append(np.mean(pairwise_distances(cluster_data)))
davies_bouldin_scores = []
for i in range(k):
db_score = 0
for j in range(k):
if i != j:
db_score += (cluster_distances[i] + cluster_distances[j]) / centroid_distances[i, j]
db_score /= (k - 1)
davies_bouldin_scores.append(db_score)
avg_db_index = np.mean(davies_bouldin_scores)
print(f"\nAverage Davies-Bouldin Index across clusters: {avg_db_index:.4f}")
Average Davies-Bouldin Index across clusters: 0.6265
A low value of Davies-Bouldin Index score indicates better clstering¶
In [115]:
from sklearn.metrics import silhouette_samples
import numpy as np
# Calculate Silhouette Coefficient for each sample (data point)
sample_silhouette_values = silhouette_samples(data, kmedoids.labels_)
# Assign Silhouette Coefficient values to each data point in the DataFrame
clustered_data['Silhouette Coefficient'] = sample_silhouette_values
# Print the Silhouette Coefficient for each cluster
print("Silhouette Coefficient for each cluster:")
for cluster in np.unique(kmedoids.labels_):
cluster_indices = np.where(kmedoids.labels_ == cluster)[0]
silhouette_cluster = np.mean(sample_silhouette_values[cluster_indices])
print(f"Cluster {cluster}: Silhouette Coefficient = {silhouette_cluster:.4f}")
Silhouette Coefficient for each cluster: Cluster 0: Silhouette Coefficient = 0.3784 Cluster 1: Silhouette Coefficient = 0.4322 Cluster 2: Silhouette Coefficient = 0.4574 Cluster 3: Silhouette Coefficient = 0.6804
A higher value of silhouette score indicates better clstering so cluster 3 is better in clustring than other clusters¶
In [116]:
# herchial silhouette_score
from sklearn.metrics import silhouette_score
silhouette_avg = silhouette_score(X, clusters)
print(f"Silhouette Score: {silhouette_avg:.4f}")
Silhouette Score: 0.7092
In [117]:
#k-medoids silhouette_score
silhouette_avg = silhouette_score(data_a, labels)
print("The average silhouette_score is :", silhouette_avg)
The average silhouette_score is : 0.5195491103947028
The silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.¶
so we found that hierarchical clustring technique is better in our case¶
In [118]:
from sklearn.metrics import calinski_harabasz_score
# Calculate Calinski-Harabasz Index for k-medoids
calinski_score = calinski_harabasz_score(data_a, labels)
print("The Calinski-Harabasz Index is:", calinski_score)
The Calinski-Harabasz Index is: 29891.74369942133
In [119]:
# for herachial
calinski_score = calinski_harabasz_score(X, clusters)
print("The Calinski-Harabasz Index is:", calinski_score)
The Calinski-Harabasz Index is: 97153242683.56349
The Calinski-Harabasz Index compares the variance within clusters to the variance between clusters. A higher index signifies better separation between clusters.¶
we found that hierarchical clustring have much higher score so it have much better separtion between clusters¶
in conclsion hierarchical clustring have better clusters quality¶
kernel density estimation (KDE) to estimate the density distribution of each cluster and identify regions of overlap.¶
In [120]:
import numpy as np
from sklearn.neighbors import KernelDensity
def calculate_cluster_densities(data, labels):
cluster_densities = []
unique_labels = np.unique(labels)
for label in unique_labels:
# Extract data points belonging to the current cluster
cluster_data = data[labels == label]
# Fit KDE model for the current cluster
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(cluster_data)
# Evaluate KDE at each data point
densities = np.exp(kde.score_samples(cluster_data))
# Store the density estimates for the current cluster
cluster_densities.append(densities)
return cluster_densities
# Example usage
# Assuming 'data' is your data array and 'labels' are the cluster labels
cluster_densities = calculate_cluster_densities(data_a, labels)
# Compute average density for each cluster
avg_cluster_densities = [np.mean(densities) for densities in cluster_densities]
# Print or visualize the average densities to identify regions of overlap
print("Average densities for each cluster:", avg_cluster_densities)
Average densities for each cluster: [0.0004301624048650191, 0.0014603683686582874, 0.008904818071545388, 0.00020793674683601603]
Cluster 3 has the highest average density, indicating that the data points within this cluster are tightly packed together, forming a high-density region. This suggests that there's likely a distinct and well-defined cluster.¶
Clusters 1, 2, and 4 have lower average densities compared to Cluster 3, indicating that the data points within these clusters are less densely packed,This could imply that these clusters may have more overlap with neighboring clusters or may not be as well-separated.¶
In [ ]:
In [ ]: